Partitioning Regular Applications for Cache-coherent Multiprocessors
نویسندگان
چکیده
In all massively parallel systems (MPPs), whether message-passing or shared-address space, the memory is physically distributed for scalability and the latency of accessing remote data is orders of magnitude higher than the processor cycle time. Therefore, the programmer/compiler must not only identify parallelism but also specify the distribution of data among the processor memories in order to obtain reasonable eeciency. Shared-address MPPs provide an easier paradigm for programmers than message passing systems since the communication is automatically handled by the hardware and/or operating system. However, it is just as important to optimize the communication in shared-address systems if high performance is to be achieved. Since communication is implied by the data layout and data reference pattern of the application, the data layout scheme and data access pattern must be controlled by the compiler in order to optimize communication. Machine speciic parameters, such as cache size and cache line size, describing the memory hierarchy of the shared-address space machine must be used to tailor the optimization of the application to the memory hierarchy of the MPP. This report focuses on a partitioning methodology to optimize application performance on cache-coherent multiprocessors. We give an algorithm for choosing block-cyclic partitions for scientiic programs with regular data structures such as dense linear algebra applications and PDE solvers. We provide algorithms to compute the cache state on exiting a parallel region given the cache state on entry; and methods to compute the overall cache-coherency traac and choose block-cyclic parameters to optimize cache-coherency traac. Our approach is demonstrated on two applications. We show that the optimal partition computed by our algorithm matches the experimentally observed optimum and we show the eeect of cache line size on partition performance.
منابع مشابه
The Vantage Cache-partitioning Technique Enables Configurability and Quality-of-service Guarantees in Large-scale Chip Multiprocessors with Shared Caches. Caches Can Have Hundreds of Partitions with Sizes Specified at Cache Line Granularity, While Maintaining High Associativity and Strict Isolation among Partitions
......Shared caches are pervasive in chip multiprocessors (CMPs). In particular, CMPs almost always feature a large, fully shared last-level cache (LLC) to mitigate the high latency, high energy, and limited bandwidth of main memory. A shared LLC has several advantages over multiple, private LLCs: it increases cache utilization, accelerates intercore communication (which happens through the cac...
متن کاملCommunication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors
Harnessing the full performance potential of cache-coherent distributed shared memory multiprocessors without inordinate user effort requires a compilation technology that can automatically manage multiple levels of memory hierarchy. This paper describes a working compiler for such machines that automatically partitions loops and data arrays to optimize locality of access. The compiler implemen...
متن کاملThe Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors
Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms is used to study performance gains in five important scientific computations that appear to be good candidates for using block transfer. Ou...
متن کاملExecution Based Evaluation of Multistage Interconnection Networks for Cache-Coherent Multiprocessors
In this paper, performance of multistage interconnection network with wormhole routing and packet switching has been evaluated for cache-coherent shared-memory multiprocessors. The traac in cache-coherent systems is characterized by traac bursts, one-to-many and many-to-one traac, and small xed length messages. The evaluation is based on execution-driven simulation using various applications. T...
متن کاملSoftware Caching on Cache-Coherent Multiprocessors
Programmers have always been concerned with data distribution and remote memory access costs on shared-memory multiprocessors that lack coherent caches, like the BBN Butterry. Recently memory latency has become an important issue on cache-coherent multiprocessors, where dramatic improvements in microprocessor performance have increased the relative cost of cache misses and coherency transaction...
متن کامل